At South by Southwest 2017, Watson Data Platform developer advocates presented a demonstration app that allowed conference attendees to quickly and contextually find those events of greatest interest to them. The app is described in this blog post.
This data science notebook analyzes the log of all user interactions in the application to help understand how it was used, and what topics were of greatest interest to users.
We're going to use PixieDust to help visualize our data. You can learn more about PixieDust at https://ibm-cds-labs.github.io/pixiedust/. In the following cell we ensure we are running the lastest version of PixieDust. Be sure to restart your kernel if instructed to do so.
In [ ]:
!pip install --user --upgrade pixiedust
In [ ]:
import pixiedust
pixiedust.enableJobMonitor()
In [ ]:
from pyspark.sql.functions import explode, lower
In [ ]:
# Enter your Cloudant host name
host = 'opendata.cloudant.com'
# Enter your Cloudant user name
username = ''
# Enter your Cloudant password
password = ''
# Enter your source database name
database = 'sxswlog'
In [ ]:
# no changes are required to this cell
# obtain SparkSession
sparkSession = SparkSession.builder.getOrCreate()
# load data
if username:
conversation_df = sparkSession.read.format("com.cloudant.spark").\
option("cloudant.host", host).\
option("cloudant.username", username).\
option("cloudant.password", password).\
load(database)
else:
conversation_df = sparkSession.read.format("com.cloudant.spark").\
option("cloudant.host", host).\
load(database)
Each document in the database represents a single conversation made with the chatbot. Each conversation includes the user, date, and the steps of the conversation. The steps are stored in an array called dialogs (referring to the dialogs in Watson Conversation that were traversed as part of the conversation). Here is a sample conversation:
"_id": "018885a1fb6cf6dbb49a8e11542e7670",
"_rev": "1-02239161bbfcbae37f5e85c43225fd4b",
"user": "phoneeb14851fc4c343e1b5dd96c6ed9e3748",
"date": 1489109308136,
"dialogs": [
{
"name": "get_music_topic",
"message": "Music",
"date": 1489343583979
},
{
"name": "search_music_topic",
"message": "Brass bands",
"date": 1489343600650
}
]
In this particular conversation the user first told the chatbot they would like to search for music gigs by sending the message "Music" to the chatbot. The user then asked the chatbot to search for "Brass bands".
In the following cell we print the schema to confirm the structure of the documents.
In [ ]:
conversation_df.printSchema()
In [ ]:
conversation_df.count()
At SXSW we were able to demonstrate the chatbot on a laptop and display, or give users the ability to run the chatbot from their own phones. They could interact with the chatbot via SMS, or a mobile-optimized version of the web app. When running from the laptop we stored the user value as "web" plus a uuid. When running from a user's phone we stored the user value as "phone" plus a uuid. We are most interested in what types of conversations were had by users who installed the chatbot. Here we filter down to only those conversations:
In [ ]:
phone_conversation_df = conversation_df.filter('user LIKE "phone%"')
phone_conversation_df.select('user').distinct().count()
In [ ]:
phone_conversation_df = phone_conversation_df.filter('size(dialogs) > 1')
phone_conversation_df.count()
Each dialog contains a message field which contains the message sent by the user, and a name field which represents the action performed by the system based on the message sent by the user and the current dialog in the conversation as managed by Watson Conversation. For example, the name search_topic maps to the action of searching for Interactive sessions. The name search_film maps to the action of searching for film screenings. We want to do some analysis on specific actions and the messages associated with those actions, so in the next cell we convert each row (which has the dialog array) into multiple rows - one for each dialog. This will make it easier for us to filter and aggregate based on the message and name fields in the dialogs.
In [ ]:
phone_dialog_df = phone_conversation_df.select(explode(phone_conversation_df.dialogs).alias("dialog"))
phone_dialog_df = phone_dialog_df.select("dialog.date",
lower(phone_dialog_df.dialog.message).alias("message"),
"dialog.name")
phone_dialog_df.printSchema()
In [ ]:
display(phone_dialog_df)
In [ ]:
interactive_dialog_df = phone_dialog_df.filter(phone_dialog_df.name == 'search_topic')
interactive_dialog_df.count()
Next we group by message. Message is the message sent by the user. In this case it essentially represents the search term entered by the user for finding Interactive sessions. Here we aggregate and display the search terms across all users:
In [ ]:
interactive_dialog_by_message_df = interactive_dialog_df.groupBy('message').count().orderBy('count', ascending=False)
display(interactive_dialog_by_message_df)
In [ ]:
display(interactive_dialog_by_message_df.limit(20))
In [ ]:
display(interactive_dialog_by_message_df.limit(10))
It's important to remember that our user population does not represent the SXSW attendee in general because the only people introduced to our app are those who chose to visit IBM's installation that week, and also chose to stop by our booth there. So what we can say is that SXSW attendees interested in IBM technology innovation are have an overwhelming interest in artificial intelligence and virtual reality, and a lesser but significant interest in design, data, health and social media.